METHOD OF IMPLEMENTING THE DATA ENCRYPTION 
STANDARD WITH REDUCED COMPUTATION 

BACKGROUND OF THE INVENTION 

5 

1. Field of the Invention 

The present invention relates generally to encryption and decryption. The invention 
relates more specifically to a unique software implementation of the round function of the 
Data Encryption Standard to reduce required computation. 

10 

2. Description of the Related Art 

An encryption algorithm provides data confidentiality by disguising data such that 
an outsider that obtains the disguised data cannot recover the original data with a 
reasonable amount of time and effort. The Data Encryption Standard (DES), described in 

1 5 United States Patent No. 3,962,539, is a highly popxxlar symmetric-key encryption 

algorithm. The DES encryption algorithm accepts a 64-bit plaintext block P and a 56-bit 
key K as inputs, and the algorithm outputs a 64-bit ciphertext block C. The plaintext 
block P is the data to be disguised, and the ciphertext block C is the disguised result. 
Similarly, the DES decryption algorithm accepts a 64-bit ciphertext block C and the 56- 

20 bit secret key K as inputs and the decryption algorithm outputs the 64-bit plaintext block 
P. Figs. 1 and 2 illustrate DES decryption and encryption, respectively. Encryption is 
denoted as DES and decryption is denoted as DES"^ 

DES is a symmetric-key cipher because the decryption key is equivalent to the 
encryption key. If DES is secure, an outsider caonot easily recover P given C without 

25 knowledge of the secret key K. In addition, an outsider cannot easily discover the secret 
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key K given a plaintext block P and the corresponding ciphertext block C encrypted under 
K. 

In the context of available computing power, DBS suffers from insufficient key 
length. Given a plaintext P and the corresponding ciphertext C for some key K, an 
5 outsider can recover the secret key K by observing the results of the DBS encryption of P 
using all possible values for K. This brute-force attack can be completed in a short period 
of time using a reasonable amount of computer hardware. To prevent such an attack, 
some communications and storage security systems employ Triple DBS (3DBS). Triple 
DBS provides a larger effective key length than DBS by sequentially encrypting a 

1 0 plaintext block with DBS three times using three different keys, 

DBS encrypts a plaintext block in three steps, as illustrated in Fig. 3. First, DBS 
performs a fixed initial permutation on the bits of the 64-bit plaintext block. The result of 
this permutation is then subjected to 16 identical rounds of permutation and substitution 
operations. The jth DBS round, where 1 < j < 16, employs a 48-bit round key RKQ) that 

1 5 is deterministically generated from K. Lastly, DBS performs a fixed final permutation on 
the output of the sixteenth round. The result of this final permutation is the 64-bit 
ciphertext block. Decryption proceeds in the same manner, but the 16 round keys 
generated from K are used in reverse order. More specifically, the jth round in DBS 
decryption uses the round key RK(17 - j). 

20 Fig, 4 depicts the jth DBS encryption round. The values L(j) and RQ) are the 

leftmost 32 bits and the rightmost 32 bits of the 64-bit input to the jth round, respectively. 
K is the 56-bit DBS secret key. The encryption round proceeds as follows. R(j) is 
subjected to the Bxpansion Permutation, a fixed mapping of the 32 bits of RG) to a 48-bit 
output. Since the number of bits in the output is greater than that of the input, some input 
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bits are mapped to multiple output bits. The 48-bit result of the Expansion Permutation i; 
then XORed with the output of the round key generation function F. F is a nonlinear 
function that accepts the 56-bit DES secret key K and the round number j as inputs; F 
outputs a 48-bit resuh, RK(j), which is the round key. The result of the XOR operation 
5 between the round key and the output of the Expansion Permutation is then divided into 
eight 6-bit blocks. These 6-bit blocks are applied as inputs to the eight DES S-boxes. 
Each S-box accepts a 6-bit input and outputs a 4-bit result, and the S-boxes represent 
fixed nonlinear functions of the input bits. The 32 output bits of the S-boxes are then 
subjected to the DES P-box Permutation. The P-box Permutation is a fixed bijective 
1 0 permutation that maps 32 input bits to 32 output bits. Lastly, the output of the P-box is 
XORed with LQ), and this 32-bit result is R(j + 1) m the next round. The value of L(j + 
1) in the next round is simply the value of RQ) from the current round. 

Implementing permutations in hardware is trivial: Wires representing input bits 
can simply be connected to wires representing the output bits without using any gates or 
1 5 other logic. In software, however, performing bit-level permutation is a difficult task. 
Instruction set architectures for existing general-purpose microprocessors do not include 
instructions that can be used to efficiently complete such permutations. Performing a bit- 
level permutation of an n-bit register can require as many as 0(n) instructions on general- 
purpose RISC or CISC microprocessors. 
20 The round key computation function F is difficult to compute in software, for F 

involves a bit-level permutation. In practice, however, the function F is rarely executed. 
A DES secret key K will often be used to encrypt/decrypt hundreds, thousands, or even 
millions of bytes of data. The 16 round keys corresponding to a secret key K can be 
computed once prior to the encryption/decryption of all the 64-bit blocks of the data 
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rather than once for each 64-bit block. Hence, the computation cost of calculating F is 
usually negligible; the computation is amortized over the encryption/decryption of many 
data blocks. 

The S-boxes are usually implemented as lookup tables in software. We refer to 
5 the input to an S-box lookup table as the index, and we refer to the possible outputs of the 
S-box as the table entries. To eliminate processing associated with the P-box 
Permutation, software implementations of DES often combine the P-box Permutation 
with the S-box lookup tables to form SP-box lookup tables. Each S-box outputs a 4-bit 
value, but the SP-boxes output a 32-bit or larger value in which the 4-bit S-box result is 

1 0 already permuted per the P-box Permutation. Hence, no explicit run-time processing is 
needed to complete the P-box Permutation: The permutation is built into the eight SP-box 
lookup tables. If the SP-box outputs are 32 bits in size, the 28 output bits that do not 
represent bits of the original 4-bit S-box output are set to zeroes. The results of the eight 
SP-box outputs can be combined by performing seven bitwise logical XOR or bitwise 

1 5 logical OR operations following the eight SP-box table lookups. 



TABLE 1: DES EXPANSION PERMUTATION 
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20 Table 1 illustrates the DES Expansion Permutation. The Expansion Permutation 
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is easier to compute than the P-box Permutation in software, for most of the permutation 
involves 6-bit blocks rather than individual bits. Each entry represents a bit of the 48-.bit 
permutation output; the entries are arranged in increasmg order from left to right and then 
from top to bottom. The number in each entry indicates the location of the bit in the 32- 
5 bit input that is mapped to the output bit corresponding to that entry. For example, bit 2 
of the 32-bit input R(j) is mapped to bit 3 of the Expansion Permutation output, as the 
numeral 2 appears in the third entry. Similarly, bit 12 of R(j) is mapped to bits 17 and 19 
of the output. In straightforward software implementations of DBS, the Expansion 
Permutation requires very little computation. The inputs to six of the eight S-boxes are 

10 simply the XOR resuhs of six contiguous bits of R(j) with six bits of the round key RK(j). 
Preparing the inputs to the first and the eighth S-boxes requires additional special 
computation due to discontinuities in the Expansion Permutation. For example, the first 
six bits of the output of the Expansion Permutation, which are involved in the generation 
of the input to the first SP-box, include bits from both the right and the left ends of the 32- 

1 5 bit value R(j). These discontinuities are easy to handle. On many RISC processors, only 
one assembly instruction - if any at all - per DES round is required to complete this 
special computation. 

To improve performance of software implementations of DES for processors with 
large memory caches, the SP-box lookup tables can be combined in pairs to form four 
20 large lookup tables. By using four SP-box tables rather than eight, the number of table 
lookups per round is reduced from eight to four, and the number of bitwise XOR (or 
bitwise OR) operations required to combine the results is reduced from seven to three. 
Each of these large SP-box tables is indexed with a 12-bit input (from the combination of 
two 6-bit inputs) and contains 4096 entries. A table entry in a large SP-box 
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corresponding to a 12-bit index W, which is the concatenation of two 6-bit indexes U and 
V, is the result of bitwise XORing the entry corresponding to the index U in the first 
small table with the entry corresponding to the index V in the second small table. If the 
size of a table entry is 32 bits, the total size of all four tables is 64 kilobytes. Large SP- 
5 boxes do not reduce the number of instructions required to complete the Expansion 
Permutation and the round key XOR operations, however. 

SUMMARY OF THE INVENTION 

The present invention improves the throughput of software implementations of 
10 DES (that employ large SP-boxes) through mathematical transformations performed on 
the DES round function and on the DES round key computation fiinction F. More 
specifically, this invention enables an implementation of the DES round that requires 
fewer assembly instructions to complete than existing hnplementations. The invention 
furnishes this performance improvement by shifting computation associated with the 
1 5 Expansion Permutation in the DES round to the infrequently executed round key 
computation fimction F. 

The SP-box lookup tables are combined in pairs to produce four large SP-box 
lookup tables, and these four lookup tables are indexed as illustrated in Figs. 5 through 8. 
We refer to the 48-bit output of the Expansion Permutation as Y. The variable y, 
20 represents the ith bit of Y, where y, is the least significant bit and y48 is the most 

significant bit. Similarly, k, represents the ith bit of the round key RK(j), where ki is the 
least significant bit and k48 is the most significant bit. The symbol 0 represents the 
bitwise logical XOR operator. For example, the leftmost bit of the index to the first large 
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SP-box would be the XOR of the two bits y, and k,. 

Let us consider the 12-bit input the second large SP-box. We call this 12-bit input 
A. The DES round function essentially requires the same number and types of operations 
to produce each of SP-box inputs; the following transformations can be applied to all four 
5 of the large SP-boxes and their respective inputs. As shown in Fig. 6, A is the result of 
bitwise logically XORing certain bits of the round key with certain bits of Y as follows: 
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Since the Expansion Permutation is fixed, we can rewrite A as follows: 
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The 32-bit input to the Expansion Permutation is referred to herein as X. The 
20 variable x, represents the ith bit of X, where Xj is the least significant bit and X32 is the 
most significant bit. Since we are considering the input to the second large SP-box, the 
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bits of X correspond to the second row of the Table 1. It is seen that the Expansion 
Permutation maps a pair of bits, and X13, to two locations in Y. 

Described hereinafter are software implementations of DES using general-purpose 
RISC processor assembly instructions. A straightforward implementation of DES round 
5 fimction requires at least 7 simple RISC assembly instructions to compute A, the 12-bit 
input for the second large SP-box. Pseudo assembly for the computation of A is listed 
below: 



LOAD 


M 




->R] 


SHIFTR 


X, 


0x7 


->U 


SHIFTR 


X, 


0x5 


->V 


AND 


u. 


0x03F 


->u 


AND 


V, 


OxFCO 


->v 


XOR 


u. 
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-> Y 


XOR 


Y, 


RK 


->A 



The register X represents the 32-bit variable X, the register Y is used to store bits 
13 through 24 of the variable Y, and the register RK is used to store bits 13 through 24 of 
the round key RK(j). The registers U and V store temporary variables. In the first 

20 instruction, the appropriate 1 2-bit chunk of the round key is loaded fi-om memory address 
M. More specifically, bits 13 through 24 of RKQ) are loaded into the 12 least significant 
bits of RK. Depending on the mstruction set architecture, this round key loading 
operation may require more than one instruction. The next 5 instructions shift and isolate 
the appropriate 12 bits of X to generate bits 13 through 24 of Y. In the two SfflFTR 

25 instructions, the register X is logically right shifted by seven and five bits, and the results 
are stored in registers U and V, respectively. During the execution of the two AND 
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instructions, the processor performs bitwise AND operations on U and V to isolate bits 8 
through 13 of X (in register U) and bits 12 through 17 of X (in register V). After the 
completion of the first XOR instruction, the 12 least significant bits of register Y contain 
bits 13 through 24 of Y. In the last XOR instruction, register Y is bitwise logically 
5 XORed with register RK, and A, the input to the second large SP-box, is stored in the 12 
least significant bits of the register A. 

If more sophisticated assembly instructions are available, it is possible to achieve 
the result of the above assembly sequence using fewer instructions. For example, the five 
inner instructions can be replaced with 3 EXTRACT and DEPOSIT instructions. 
1 0 The present invention enables a reduction in the number of instructions required to 

compute the inputs for the four large SP-boxes. More specifically, this invention can be 
used to reduce the length of the previously described assembly code sequence firom seven 
instructions to four instructions. If EXTRACT and DEPOSIT are used, the number of 
instructions can be reduced from five to three with this invention. 

15 

BRIEF DESCRIPTION OF THE DRAWING 

The above and other objects and advantages of the present invention will become 
apparent firom the following description when read in conjunction with the accompanying 
drawing wherein: 

20 Fig. 1 is a simplified prior art illustration of DES encryption; 

Fig.. 2 is a simplified prior art illustration of DES decryption; 

Fig. 3 is a prior art flow diagram of the steps for DES encryption of a plaintext 

block; 

Fig. 4 is a prior art flow diagram of the steps for a single DES encryption round; 
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Figs. 5 through 8 are representations of index computations for the four large SP- 
boxes used in the invention; and 

Figs. 9 through 12 are representations of index computation for the RSP-boxes 
used in the invention. 

5 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 

We begin by describing a new indexing scheme for the large SP-boxes. We refer 
to the new input into the second large SP-box as B. The variable h, represents the ith bit 
of B where bj is the least significant bit and b,2 is the most significant bit. B is calculated 
10 as follows: 
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The leftmost (i.e., least significant) 10 bits of B consist of all 12 bits of A except 
(x,3 © kjo) and (x,2 © k,,). We can infer the values of (x^ © k,,) and (x,2 ® k,,) by 
logically XORing bits of B together. By the properties of the XOR operator: 

20 (b,2 © he) = (k^o © k„) © (X,3 © k,8) = (X,3© kjo) 

(b„ © bs) = (k„ © k„) © (X,2 © k„) = (X,2 © k,,) 
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From these expressions, it can be concluded that B contains as least as much 
information as A. We can therefore obtain the correct output from the second large SP- 
box using the input B (rather than A) if we rearrange the large SP-box entries. This static 
rearrangement of SP-box entries occurs at or before software compile time, and therefore 
5 the computation required to reorder the entries does not affect DES encryption or 
decryption performance. 

To generate B, we require a single block of contiguous bits from X rather than two 
blocks of contiguous bits (as required to generate A). As a result, the present invention 
enables a software implementation of the DES round that employs fewer assembly 

10 instructions to compute B and the other three inputs to the rearranged large SP-boxes. We 
avoid the computation cost associated with generating the modified round keys required 
to compute the inputs to the rearranged SP-boxes by performing the necessary XOR and 
shift operations during the computation of F. The new F would output modified round 
keys that could be directly XORed with contiguoxis bits of X to properly index the 

1 5 rearranged SP-boxes. Since the fimction F is computed infrequently relative to the 

execution of the DES round, the computation cost of performing the additional XOR and 
shift operations is negligible. 

This invention mcludes all possible methods of generating indexes to rearranged 
SP-boxes (that each contain 4096 or more entries) in which a block of contiguous bits 

20 from X is directly XORed vdth a modified round key. The method used to rearrange the 
entries of the large SP-boxes depends on the method used to generate the new inputs to 
the SP-boxes, however. For example, B, the input to the second large rearranged SP-box 
could alternatively be calculated as follows: 
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5 

The same number of assembly instructions would be required to compute the 
original and alternative versions of B, but the SP-boxes would have to be rearranged 
differently in the two cases to guarantee the correct output of the DES round. 

Depending on the processor that executes the software, two blocks of contiguous 

10 bits of X (rather than one block) may be required to generate the input to the first large 
rearranged SP-box, the input to the fourth large rearranged SP-box, or the inputs to both 
the first and the fourth large rearranged SP-boxes. The Expansion Permutation maps the 
1 st and the 32nd bits of X to the inputs to the first and the fourth large (original and 
rearranged) SP-boxes. Since these bits may not be contiguous in a register that stores X 

15 or a fimction of X, two blocks of contiguous bits of X may need to be employed to 

generate the said SP-box inputs. This invention includes all methods of generating the 
input to a large rearranged SP-box using two blocks of contiguous bits of X if one block 
includes the first bit of the rightmost 32 bits of the input to the DES round and the other 
block includes the 32nd bit of the rightmost 32 bits of the input to the DES round. 

20 We now describe a sample implementation of the invention. The entries of four 

large SP-boxes are rearranged as follows. In each of the four SP-box lookup tables, each 
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of the 4096 possible 12-bit indexes corresponds to a different 32-bit entry. Let the 12-bit 
index be adcdefghijkl, where each letter represents a single bit: 
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The leftmost letter, a, represents the least significant bit of the index, and the 
rightmost letter, 1, represents the most significant bit. A 32-bit entry in an original large 
SP-box with corresponding index abcdefghijkl will be mapped to the entry in a rearranged 
RSP-box corresponding to the following index: 
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The rearranged large SP-boxes are hereinafter referred to as RSP-boxes. No 
entries are exchanged between different SP-boxes. All the entries fi-om the first large SP- 
box will be mapped to locations in the first RSP-box, all the entries firom the second large 
SP-box will be mapped to locations in the second RSP-box, etc. In addition, no 
1 5 modifications are made to the contents of the table entries; the rearrangement treats 

entries as atomic elements and simply involves permuting the entries within the SP-boxes 
as described above. 

In addition to making changes to the large SP-boxes, we also modify the round 
key computation fiinction F. The following bits of the round key need to be logically 
20 XORed with bits of X in order to generate B and properly index the second RSP-box: 
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As explained herein, using the present invention, we avoid the computation cost of 
preparing the modified round keys by performing the necessary operations when 
computing the roimd key computation function F. The appropriate block of the modified 
5 round key can then be directly loaded from memory and XORed with bits of X during the 
execution of the DES round. 

We now describe how to generate the four indexes for all four RSP-boxes. Figs. 9 
through 12 illustrate which bits of X need to be XORed with bits of the round key to 
generate the four indexes. While the invention has been described primarily in terms of 
10 XOR operations, this invention includes all possible methods of modifying or adding 
operations to the DES roimd key computation function that enable the novel SP-box 
indexing scheme described herein. 

Using the mathematical transformation of the present invention, software 
implementations of a DES round can employ contiguous blocks of bits from X when 
1 5 computing the indexes for the large SP-boxes. As a result, generating the input to an 
RSP-box requires only four conventional RISC assembly instructions rather than seven. 
Pseudo assembly for the computation of B, the index into the second RSP-box, is listed 
below: 



20 LOAD M ->RK 

SHIFTR X, 0x7 ->U 

AND U, OxSFF ->U 

XOR U, RK ->B 

25 The first instruction loads the modified round key from memory address M and 
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stores the result in register RK. The next two instructions place bits 8 through 17 of X in 
the least significant bits of register U. If instructions such as EXTRACT are available, 
these two instructions can be replaced with a single instruction. Lastly, the modified 
round key is XORed with bits 8 through 17 of X to generate the index B. 
5 Having thus disclosed a preferred embodiment of the invention, it is understood 

that modifications and variations may be readily carried out without deviating from the 
underlying inventive principles described herein, and that the invention is limited only by 
the appended claims and their equivalents. 
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